The Search for a Cost Matrix to Solve Rare-Class Biological Problems

نویسنده

  • Mark J. Lawson
چکیده

The rare-class data classification problem is a common one. It occurs when, in a dataset, the class of interest is far outweighed by other classes, thus making it difficult to classify using typical classification algorithms. These types of problems are found quite often in biological datasets, where data can be sparse and the class of interest has few representatives. A variety of solutions to this problem exist with varying degrees of success. In this paper, we present our solution to the rare-class problem. This solution uses MetaCost, a cost-sensitive meta-classifier, that takes in a classification algorithm, training data, and a cost matrix. This cost matrix adjusts the learning of the classification algorithm to classify more of the rare-class data but is generally unknown for a given dataset and classifier. Our method uses three different types of optimization techniques (greedy, simulated annealing, genetic algorithm) to determine this optimal cost matrix. In this paper we will show how this method can improve upon classification in a large amount of datasets, achieving better results along a variety of metrics. We will show how it can improve on different classification algorithms and do so better and more consistently than other rare-class learning techniques like oversampling and undersampling. Overall our method is a robust and effective solution to the rare-class problem. Dedication To my friends and family, who have supported me throughout To Liqing, the best advisor a grad student could have To Christina, my love and inspiration GO HOKIES! iii Acknowledgments I would like to acknowledge all of the professors of my committee for their guidance and assistance in the creation of this dissertation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of Tabu Search to a Special Class of Multicommodity Distribution Systems

Multicommodity distribution problem is one of the most interesting and useful models in mathematical programming due to its major role in distribution networks. The purpose of this paper is to describe and solve a special class of multicommodity distribution problems in which shipment of a commodity from a plant to a customer would go through different distribution centers. The problem is t...

متن کامل

Application of Tabu Search to a Special Class of Multicommodity Distribution Systems

Multicommodity distribution problem is one of the most interesting and useful models in mathematical programming due to its major role in distribution networks. The purpose of this paper is to describe and solve a special class of multicommodity distribution problems in which shipment of a commodity from a plant to a customer would go through different distribution centers. The problem is t...

متن کامل

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

Optimizing a Cost Matrix to Solve Rare-Class Biological Problems

In a binary dataset, a rare-class problem occurs when one class of data (typically the class of interest) is far outweighed by the other. Such a problem is typically difficult to learn and classify and is quite common, especially among biological problems such as the identification of gene conversions. A multitude of solutions for this problem exist with varying levels of success. In this paper...

متن کامل

A Free Line Search Steepest Descent Method for Solving Unconstrained Optimization Problems

In this paper, we solve unconstrained optimization problem using a free line search steepest descent method. First, we propose a double parameter scaled quasi Newton formula for calculating an approximation of the Hessian matrix. The approximation obtained from this formula is a positive definite matrix that is satisfied in the standard secant relation. We also show that the largest eigen value...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009